Rank-frequency relation for Chinese characters
نویسندگان
چکیده
We show that the Zipf's law for Chinese characters perfectly holds for sufficiently short texts (few thousand different characters). The scenario of its validity is similar to the Zipf's law for words in short English texts. For long Chinese texts (or for mixtures of short Chinese texts), rank-frequency relations for Chinese characters display a two-layer, hierarchic structure that combines a Zipfian power-law regime for frequent characters (first layer) with an exponential-like regime for less frequent characters (second layer). For these two layers we provide different (though related) theoretical descriptions that include the range of low-frequency characters (hapax legomena). The comparative analysis of rank-frequency relations for Chinese characters versus English words illustrates the extent to which the characters play for Chinese writers the same role as the words for those writing within alphabetical systems.
منابع مشابه
On the Ranking Property and Underlying Dynamics of Complex Systems
Ranking procedures are widely used to describe the phenomena in many different fields of social and natural sciences, e.g., sociology, economics, linguistics, demography, physics, biology, etc. In this dissertation, we dedicated to study the ranking properties and underlying dynamics embedded in complex systems. In particular, we focused on the scores/prizes ranking in sports systems and the wo...
متن کاملA Common Construction Pattern of English Words and Chinese Characters
Rankings are ubiquitous around the world. Here I investigate spatial ranking patterns of English Words and Chinese Characters, and reveal a common construction pattern related to phase separation. In detail, I analyze a list of different words in the English language, and find that the frequency of the number of letters per word linearly or nonlinearly decays over its rank in the frequency tabl...
متن کاملExtension of Zipf's Law to Word and Character N-grams for English and Chinese
It is shown that for a large corpus, Zipf 's law for both words in English and characters in Chinese does not hold for all ranks. The frequency falls below the frequency predicted by Zipf's law for English words for rank greater than about 5,000 and for Chinese characters for rank greater than about 1,000. However, when single words or characters are combined together with n-gram words or chara...
متن کاملEncoding and Ranking Similar Chinese Characters
Automatically detecting similar Chinese characters is useful in many areas, such as building intelligent authoring tools (e. g. automatic multiple choice question generation) in the area of computer assisted language learning. Previous work on the computation of Chinese character similarity focused on detecting character glyph similarity while ignored the importance of other character features,...
متن کاملMaximum Entropy, Word-Frequency, Chinese Characters, and Multiple Meanings
The word-frequency distribution of a text written by an author is well accounted for by a maximum entropy distribution, the RGF (random group formation)-prediction. The RGF-distribution is completely determined by the a priori values of the total number of words in the text (M), the number of distinct words (N) and the number of repetitions of the most common word (k(max)). It is here shown tha...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1309.1536 شماره
صفحات -
تاریخ انتشار 2013